FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems

نویسندگان

چکیده

Training state-of-the-art artificial intelligence (AI) models requires scaling to many compute nodes and relies heavily on collective communication operations, such as all-reduce, exchange the weight gradients between nodes. The overhead of these operations can bottleneck training performance number increases. In this paper, we first characterize all-reduce operation overhead. Then, propose a new smart network interface card (NIC) for distributed AI using field-programmable gate arrays (FPGAs) accelerate optimize bandwidth utilization via data compression. NIC frees up system's resources perform more compute-intensive tensor increases overall node-to-node efficiency. We build prototype 6-node system show that our proposed FPGA-based enhances by 1.6×, with an estimated 2.5× improvement at 32

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Techniques for Distributed Implementation of Search-Based AI Systems

We study the problem of exploiting parallelism from search-based AI systems on distributed machines. We propose stack-splitting, a technique for implementing orparallelism, which when coupled with appropriate scheduling strategies leads to: (i) reduced communication during distributed execution; and, (ii) distribution of larger grainsized work to processors. The modified technique can also be i...

متن کامل

Distributed Control for AI

This paper discusses a number of elementary problems in distributed computing and a couple of well-known algorithmic \building blocks", which are used as procedures in distributed applications. We shall not strive for completeness, as an enumeration of the many known distributed algorithms would be pointless and endless. We do not even try to touch all relevant sub-areas and problems studied in...

متن کامل

Metadatabase Meets Distributed AI

Heterogeneous Distributed Database Management Systems (HDDBMS) involve the interoperability of data sources. One approach to achieve this type of integration is to build interfaces between the diierent databases being integrated. This approach holds, for a particular case, at a speciic point in time. In this case however, the database structures need to be adapted. Such adaptation is not advisa...

متن کامل

AI{based Trading in Open Distributed Environments

An open distributed environment can be perceived as a service market where services are freely o ered and requested. Any infrastructure which seeks to provide appropriate mechanisms for such an environment has to include mediator functionality (i.e. a trader) that matches service requests and service o ers. Commonly, the matching process is based upon some IDL{based service type de nition, and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Computer Architecture Letters

سال: 2022

ISSN: ['2473-2575', '1556-6056', '1556-6064']

DOI: https://doi.org/10.1109/lca.2022.3189207